We are going to use TMDb data; this data set contains informationabout 10,000 movies collected fromThe Movie Database (TMDb), including user ratings and revenue.
- Certain columns, like ‘cast’and ‘genres’, contain multiplevalues separated by pipe (|) characters.
- There are some odd charactersin the ‘cast’ column. We are going to clean them.
- The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2015 dollars,accounting for inflation overtime.
We will download the data in csv format to be used as a data frame with the pandas package and we will proceed to select the columns that will be significant for our analysis, proceeding to clean the data and then explore the data trying to give some questions and make a summary in the conclusion section.
We are going to anwser the follow questions:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import json
# Load your data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
movies = pd.read_csv('tmdb-movies.csv')
movies.head(2)
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
2 rows × 21 columns
print(movies.shape)
(10866, 21)
The data Frame has 10,866 rows and 21 columns.
movies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10866 entries, 0 to 10865 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10866 non-null int64 1 imdb_id 10856 non-null object 2 popularity 10866 non-null float64 3 budget 10866 non-null int64 4 revenue 10866 non-null int64 5 original_title 10866 non-null object 6 cast 10790 non-null object 7 homepage 2936 non-null object 8 director 10822 non-null object 9 tagline 8042 non-null object 10 keywords 9373 non-null object 11 overview 10862 non-null object 12 runtime 10866 non-null int64 13 genres 10843 non-null object 14 production_companies 9836 non-null object 15 release_date 10866 non-null object 16 vote_count 10866 non-null int64 17 vote_average 10866 non-null float64 18 release_year 10866 non-null int64 19 budget_adj 10866 non-null float64 20 revenue_adj 10866 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.7+ MB
We see that there are varibles with na's values that we are going to clean them
We are going to work with the budget_adj and revenue_adj which are adjusted for inflation. The mean of those value is a big numbers, so we are going to pass to a million this columns to make this number more easy to read and we'll create profit column (revenue_adj - budget_adj)
movies['budget_adj'] = movies['budget_adj'].div(1000000).round(2)
movies['revenue_adj'] = movies['revenue_adj'].div(1000000).round(2)
movies['profit'] = movies['revenue_adj'] - movies['budget_adj']
movies.describe()
| id | popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 |
| mean | 66064.177434 | 0.646441 | 1.462570e+07 | 3.982332e+07 | 102.070863 | 217.389748 | 5.974922 | 2001.322658 | 17.551067 | 51.364352 | 33.813285 |
| std | 92130.136561 | 1.000185 | 3.091321e+07 | 1.170035e+08 | 31.381405 | 575.619058 | 0.935142 | 12.812941 | 34.306149 | 144.632437 | 125.214988 |
| min | 5.000000 | 0.000065 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 10.000000 | 1.500000 | 1960.000000 | 0.000000 | 0.000000 | -413.910000 |
| 25% | 10596.250000 | 0.207583 | 0.000000e+00 | 0.000000e+00 | 90.000000 | 17.000000 | 5.400000 | 1995.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 20669.000000 | 0.383856 | 0.000000e+00 | 0.000000e+00 | 99.000000 | 38.000000 | 6.000000 | 2006.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 75610.000000 | 0.713817 | 1.500000e+07 | 2.400000e+07 | 111.000000 | 145.750000 | 6.600000 | 2011.000000 | 20.850000 | 33.695000 | 12.925000 |
| max | 417859.000000 | 32.985763 | 4.250000e+08 | 2.781506e+09 | 900.000000 | 9767.000000 | 9.200000 | 2015.000000 | 425.000000 | 2827.120000 | 2750.130000 |
print(movies.duplicated().sum())
1
There is one duplicate row
# Check is all id movies are unique
print(movies['id'].nunique())
print(movies.shape)
10865 (10866, 22)
The number of id value is one less than the number of row so there is one duplicated id movies. Maybe there can be more movies with the same title, but in any case they would be different movies.We are gonna drop the duplcated row
# Check if trere ara NA's in the data
movies.isna().sum()
id 0 imdb_id 10 popularity 0 budget 0 revenue 0 original_title 0 cast 76 homepage 7930 director 44 tagline 2824 keywords 1493 overview 4 runtime 0 genres 23 production_companies 1030 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 profit 0 dtype: int64
There are columns with hight numbers of na's, so we have to clean them.
We are going to remove some columns of movie data that are not significant for our analysis. We will inspect the na's that are still in the movie data to see which method will be more appropriate to clean up the na's, substituting the average of the same column or perhaps removing the specific row. We'll check for duplicate values to remove them, create a new column if needed, and rename columns if needed.
# drop duplicated rows
movies.drop_duplicates(inplace = True)
print(movies.duplicated().sum())
0
Those variables are no significant for our analisis.
movies.drop(['id', 'imdb_id', 'runtime', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview', 'budget', 'revenue'],
axis=1, inplace = True)
#Let's see if there are budgets with a value of cero or less of that
print(movies[movies['budget_adj']==0].shape)
print(movies[movies['revenue_adj']==0].shape)
(5758, 13) (6086, 13)
movies['budget_adj'] =movies['budget_adj'].replace(0,np.nan)
movies['revenue_adj'] =movies['revenue_adj'].replace(0,np.nan)
movies.revenue_adj.fillna(movies.revenue_adj.mean(), inplace=True)
movies.budget_adj.fillna(movies.budget_adj.mean(), inplace=True)
movies.isna().sum()
popularity 0 original_title 0 cast 76 director 44 genres 23 production_companies 1030 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 profit 0 dtype: int64
About 10% of data remains with na's in production_companies but, I want to use this column, so we are going to drop those na's.
movies.dropna(inplace=True)
# check for NAN's
movies.isna().sum()
popularity 0 original_title 0 cast 0 director 0 genres 0 production_companies 0 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 profit 0 dtype: int64
movies.shape
(9772, 13)
Now there are not na's in the data, so the data has 9772 rows and 13 columns
we convert that column to be able to order by date.
# Convert string carhater to data time
movies['release_date']= pd.to_datetime(movies['release_date'])
# check release_date
movies[['release_date']].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9772 entries, 0 to 10865 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 release_date 9772 non-null datetime64[ns] dtypes: datetime64[ns](1) memory usage: 152.7 KB
Now, the format in the column is datatime, so we can order and operate with that
# check the columns in movies
movies.columns
Index(['popularity', 'original_title', 'cast', 'director', 'genres',
'production_companies', 'release_date', 'vote_count', 'vote_average',
'release_year', 'budget_adj', 'revenue_adj', 'profit'],
dtype='object')
Now, We have a clean data frame, so let's look to explore the data
*Now that we've trimmed and cleaned our data, I'm ready to move on to exploration. We are going to compute statistics and create some visualizations with the goal to answer some questions. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.
Before we answer the question we are going to do some exploratory analisys to give us and idea about the data.
movies.describe()
| popularity | vote_count | vote_average | release_year | budget_adj | revenue_adj | profit | |
|---|---|---|---|---|---|---|---|
| count | 9772.000000 | 9772.000000 | 9772.000000 | 9772.000000 | 9772.000000 | 9772.000000 | 9772.000000 |
| mean | 0.694721 | 239.312014 | 5.963528 | 2000.878428 | 37.771199 | 117.836706 | 37.637055 |
| std | 1.036931 | 603.011504 | 0.913174 | 13.036794 | 30.127970 | 139.301798 | 131.475516 |
| min | 0.000188 | 10.000000 | 1.500000 | 1960.000000 | 0.010000 | 0.010000 | -413.910000 |
| 25% | 0.232710 | 18.000000 | 5.400000 | 1994.000000 | 23.400000 | 52.217500 | 0.000000 |
| 50% | 0.419762 | 46.000000 | 6.000000 | 2005.000000 | 37.336967 | 116.786792 | 0.000000 |
| 75% | 0.776408 | 173.000000 | 6.600000 | 2011.000000 | 37.336967 | 116.786792 | 20.290000 |
| max | 32.985763 | 9767.000000 | 8.700000 | 2015.000000 | 425.000000 | 2827.120000 | 2750.130000 |
movies.corr()
<ipython-input-75-0119fb4bb284>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. movies.corr()
| popularity | vote_count | vote_average | release_year | budget_adj | revenue_adj | profit | |
|---|---|---|---|---|---|---|---|
| popularity | 1.000000 | 0.802956 | 0.239121 | 0.110944 | 0.399447 | 0.513575 | 0.561496 |
| vote_count | 0.802956 | 1.000000 | 0.279851 | 0.124726 | 0.494277 | 0.625030 | 0.653544 |
| vote_average | 0.239121 | 0.279851 | 1.000000 | -0.140046 | 0.071707 | 0.148307 | 0.217499 |
| release_year | 0.110944 | 0.124726 | -0.140046 | 1.000000 | 0.030412 | -0.058876 | -0.074652 |
| budget_adj | 0.399447 | 0.494277 | 0.071707 | 0.030412 | 1.000000 | 0.525413 | 0.407245 |
| revenue_adj | 0.513575 | 0.625030 | 0.148307 | -0.058876 | 0.525413 | 1.000000 | 0.925392 |
| profit | 0.561496 | 0.653544 | 0.217499 | -0.074652 | 0.407245 | 0.925392 | 1.000000 |
# Histograms of all the numeric variables
movies.hist(figsize=(20,15));
sns.pairplot(movies);
We are goin to make a plot to try to anwser this question.
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15})
g = sns.regplot(x="popularity", y="revenue_adj", data= movies, scatter_kws={'alpha':0.4})
g.set_xlabel('Popularity')
g.set_ylabel('Adjusted Revenue')
g.set(title='Popularity vs Adjusted Revenue')
plt.gcf().set_size_inches(10, 7);
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15, 'legend.fontsize':15})
g = sns.relplot(x="popularity", y="profit", size="budget_adj",
sizes=(10, 300),alpha =0.7, data=movies[movies['profit'] >0])
g.set_xlabels('Popularity')
g.set_ylabels('Profit')
#g.set(title = 'Profit by Movie vs Number of Movies and represent by profit' )
plt.text(x= 5, y=3200, s='Popularity vs Profit by Movie', fontsize=16, weight='bold')
plt.text(x= 5, y=3000, s='The size of each point corresponds to the adjusted budget in millions', fontsize=12, alpha=0.75)
plt.gcf().set_size_inches(10, 7);
We are goin to make a plot to try to anwser this question. Let's explore the relation between budget_adj and popularity Making a plot to try to get it
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15})
g = sns.regplot(x="budget_adj", y="popularity", data= movies, scatter_kws={'alpha':0.4})
g.set_xlabel('Adjusted Budget')
g.set_ylabel('popularity')
g.set(title='Adjusted Budget vs Popularity')
plt.gcf().set_size_inches(10, 7);
movies[['original_title', 'vote_average', 'vote_count']].sort_values(by='vote_count', ascending = False).head(20)
| original_title | vote_average | vote_count | |
|---|---|---|---|
| 1919 | Inception | 7.9 | 9767 |
| 4361 | The Avengers | 7.3 | 8903 |
| 1386 | Avatar | 7.1 | 8458 |
| 2875 | The Dark Knight | 8.1 | 8432 |
| 4364 | Django Unchained | 7.7 | 7375 |
| 4382 | The Hunger Games | 6.7 | 7080 |
| 5425 | Iron Man 3 | 6.9 | 6882 |
| 4363 | The Dark Knight Rises | 7.5 | 6723 |
| 629 | Interstellar | 8.0 | 6498 |
| 4367 | The Hobbit: An Unexpected Journey | 6.9 | 6417 |
| 2410 | The Matrix | 7.8 | 6351 |
| 2877 | Iron Man | 7.3 | 6220 |
| 1 | Mad Max: Fury Road | 7.1 | 6185 |
| 4365 | Skyfall | 6.8 | 6137 |
| 2633 | The Lord of the Rings: The Fellowship of the Ring | 7.8 | 6079 |
| 2409 | Fight Club | 8.1 | 5923 |
| 4178 | The Shawshank Redemption | 8.4 | 5754 |
| 4949 | The Lord of the Rings: The Return of the King | 7.9 | 5636 |
| 630 | Guardians of the Galaxy | 7.9 | 5612 |
| 0 | Jurassic World | 6.5 | 5562 |
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15})
sns.set_style("darkgrid")
g =sns.catplot(x="vote_count", y="original_title", kind = 'bar',color= '#abc9ea',
data=movies[['original_title', 'vote_average', 'vote_count']].sort_values(by='vote_count', ascending = False).reset_index().head(20))
g.set_axis_labels( 'Total Votes', 'Movie')
#g.set_xticklabels(rotation=75)
g.set(title='Top 20 Movies (more votes)')
#g.bar_label(g.containers[0])
plt.gcf().set_size_inches(30, 10)
movies[['original_title', 'profit']].sort_values(by='profit', ascending = False).head(20)
| original_title | profit | |
|---|---|---|
| 1329 | Star Wars | 2750.13 |
| 1386 | Avatar | 2586.23 |
| 5231 | Titanic | 2234.72 |
| 10594 | The Exorcist | 2128.03 |
| 9806 | Jaws | 1878.65 |
| 8889 | E.T. the Extra-Terrestrial | 1767.96 |
| 3 | Star Wars: The Force Awakens | 1718.72 |
| 8094 | The Net | 1551.57 |
| 10110 | One Hundred and One Dalmatians | 1545.63 |
| 7309 | The Empire Strikes Back | 1377.00 |
| 10398 | The Jungle Book | 1319.40 |
| 10223 | Jurassic Park | 1293.76 |
| 0 | Jurassic World | 1254.45 |
| 7269 | The Godfather | 1246.62 |
| 4361 | The Avengers | 1234.25 |
| 4949 | The Lord of the Rings: The Return of the King | 1214.86 |
| 4 | Furious 7 | 1210.95 |
| 7987 | Return of the Jedi | 1183.00 |
| 3374 | Harry Potter and the Deathly Hallows: Part 2 | 1166.01 |
| 4180 | The Lion King | 1093.39 |
The most profitable movie in all time is Star Wars, the second place is Avatar and Titanic in third place.
We are going to use all the data
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15})
g = sns.regplot(x="budget_adj", y="revenue_adj", data= movies, scatter_kws={'alpha':0.4})
g.set_xlabel('Adjusted Budget')
g.set_ylabel('Adjusted Revenue')
g.set(title='Adjusted Budget vs Adjusted Revenue (millions $)')
plt.gcf().set_size_inches(10, 7);
movies.groupby('release_year').agg({'original_title': 'count'}).reset_index()
| release_year | original_title | |
|---|---|---|
| 0 | 1960 | 32 |
| 1 | 1961 | 30 |
| 2 | 1962 | 30 |
| 3 | 1963 | 34 |
| 4 | 1964 | 41 |
| 5 | 1965 | 34 |
| 6 | 1966 | 44 |
| 7 | 1967 | 36 |
| 8 | 1968 | 37 |
| 9 | 1969 | 29 |
| 10 | 1970 | 40 |
| 11 | 1971 | 55 |
| 12 | 1972 | 37 |
| 13 | 1973 | 55 |
| 14 | 1974 | 44 |
| 15 | 1975 | 42 |
| 16 | 1976 | 46 |
| 17 | 1977 | 55 |
| 18 | 1978 | 61 |
| 19 | 1979 | 56 |
| 20 | 1980 | 73 |
| 21 | 1981 | 78 |
| 22 | 1982 | 80 |
| 23 | 1983 | 77 |
| 24 | 1984 | 99 |
| 25 | 1985 | 106 |
| 26 | 1986 | 109 |
| 27 | 1987 | 113 |
| 28 | 1988 | 132 |
| 29 | 1989 | 128 |
| 30 | 1990 | 126 |
| 31 | 1991 | 125 |
| 32 | 1992 | 126 |
| 33 | 1993 | 172 |
| 34 | 1994 | 174 |
| 35 | 1995 | 170 |
| 36 | 1996 | 196 |
| 37 | 1997 | 180 |
| 38 | 1998 | 191 |
| 39 | 1999 | 212 |
| 40 | 2000 | 196 |
| 41 | 2001 | 212 |
| 42 | 2002 | 241 |
| 43 | 2003 | 254 |
| 44 | 2004 | 253 |
| 45 | 2005 | 323 |
| 46 | 2006 | 350 |
| 47 | 2007 | 389 |
| 48 | 2008 | 441 |
| 49 | 2009 | 474 |
| 50 | 2010 | 420 |
| 51 | 2011 | 456 |
| 52 | 2012 | 498 |
| 53 | 2013 | 567 |
| 54 | 2014 | 635 |
| 55 | 2015 | 558 |
# Lets make a plot with all the movies by year
sns.set_context("paper", rc={"font.size":30,"axes.titlesize":20,"axes.labelsize":15})
sns.set_style("darkgrid")
g =sns.relplot(x="release_year", y="original_title", kind = 'line',
data=movies.groupby('release_year').agg({'original_title': 'count'}).reset_index())
g.set_axis_labels("Year", "movies")
g.set(title='Total Movies Over Time')
plt.gcf().set_size_inches(20, 7)
## Top 20 Director by total profits
movies.groupby('director').agg({'vote_count': 'sum', 'vote_average': 'mean',
'profit': 'sum', 'popularity': 'mean'}).sort_values(by = 'profit', ascending=False).head(20)
| vote_count | vote_average | profit | popularity | |
|---|---|---|---|---|
| director | ||||
| Steven Spielberg | 36735 | 6.824138 | 13050.64 | 1.875138 |
| James Cameron | 21770 | 6.844444 | 6296.23 | 3.202686 |
| George Lucas | 12699 | 6.700000 | 5844.15 | 3.910306 |
| Peter Jackson | 31796 | 7.000000 | 5644.11 | 3.774454 |
| Chris Columbus | 15169 | 6.400000 | 4353.34 | 2.251037 |
| Robert Zemeckis | 21532 | 6.705882 | 4326.95 | 1.977627 |
| Michael Bay | 17477 | 6.327273 | 3958.66 | 2.023170 |
| David Yates | 13819 | 7.400000 | 3394.44 | 5.126634 |
| Tim Burton | 17309 | 6.661111 | 3200.47 | 1.923092 |
| Christopher Nolan | 41712 | 7.577778 | 3149.55 | 6.865185 |
| Ron Howard | 10759 | 6.377273 | 3051.14 | 1.295499 |
| Roland Emmerich | 11018 | 5.872727 | 2936.11 | 1.660508 |
| J.J. Abrams | 14962 | 7.020000 | 2689.65 | 4.800957 |
| Gore Verbinski | 14371 | 6.366667 | 2656.46 | 2.452819 |
| Clint Eastwood | 12103 | 6.508824 | 2636.19 | 0.830037 |
| Guy Hamilton | 1669 | 6.190000 | 2608.28 | 1.059926 |
| Sam Raimi | 13224 | 6.357143 | 2525.86 | 1.160712 |
| Ridley Scott | 24801 | 6.478261 | 2479.41 | 2.009294 |
| Richard Donner | 5627 | 6.306250 | 2383.92 | 1.111420 |
| Francis Lawrence | 14859 | 6.750000 | 2377.72 | 4.437604 |
Steven Spielberg is the most profitable director over time
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15})
sns.set_style("darkgrid")
g =sns.catplot(x="director", y="profit", kind="bar", color= '#abc9ea',
data=movies.groupby('director').agg({'vote_count': 'sum', 'vote_average': 'mean',
'profit': 'sum', 'popularity': 'mean'}).sort_values(by = 'profit',
ascending=False).reset_index().head(20))
g.set_axis_labels('Director', "Profit (US millions)")
g.set_xticklabels(rotation=30)
g.set(title='Top 20 profitable Directors')
plt.gcf().set_size_inches(20, 7)
# To anwser that question we are going to consider directors with more than ten movies. We do that
# trying to be consisten with the data, because there are some director with few movies and the mean
# can be affected for that.
direct_popularity =movies.groupby('director').agg({ 'original_title': 'count',
'popularity': 'mean'}).sort_values(by = 'popularity',
ascending=False).reset_index()
direct_popularity =direct_popularity[direct_popularity['original_title'] >=10].sort_values(by = 'popularity',
ascending=False).reset_index(drop=True)
direct_popularity.head(10)
| director | original_title | popularity | |
|---|---|---|---|
| 0 | Quentin Tarantino | 10 | 4.599180 |
| 1 | Peter Jackson | 13 | 3.774454 |
| 2 | David Fincher | 10 | 3.447978 |
| 3 | Chris Columbus | 14 | 2.251037 |
| 4 | Michael Bay | 11 | 2.023170 |
| 5 | Ridley Scott | 23 | 2.009294 |
| 6 | Robert Zemeckis | 17 | 1.977627 |
| 7 | Tim Burton | 18 | 1.923092 |
| 8 | Antoine Fuqua | 10 | 1.882736 |
| 9 | Steven Spielberg | 29 | 1.875138 |
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15})
sns.set_style("darkgrid")
g =sns.catplot(x="director", y="popularity", kind="bar", color= '#abc9ea',
data=direct_popularity.sort_values(by = 'popularity', ascending=False).reset_index().head(10))
g.set_axis_labels('Director', 'Popularity')
g.set_xticklabels(rotation=30)
g.set(title='Top 10 Directors by Popularity (at least 10 movies)')
plt.gcf().set_size_inches(20, 7)
movies.groupby('director').agg({'original_title': 'count'}).sort_values(by = 'original_title',
ascending=False).reset_index().head(10)
| director | original_title | |
|---|---|---|
| 0 | Woody Allen | 42 |
| 1 | Clint Eastwood | 34 |
| 2 | Steven Spielberg | 29 |
| 3 | Martin Scorsese | 26 |
| 4 | Ridley Scott | 23 |
| 5 | Ron Howard | 22 |
| 6 | Steven Soderbergh | 22 |
| 7 | Joel Schumacher | 21 |
| 8 | Brian De Palma | 20 |
| 9 | Wes Craven | 19 |
def split_list(data, col):
# split the string data in each cel of the col into a list
# and replace the column values with the list
data[col]=data[col].apply(lambda x: x.split('|'))
# We used Pandas explode() to separate list elements into separate rows()
data_split = data.explode(col)
return data_split
df_actor = split_list(movies, 'cast')
df_genre = split_list(movies, 'genres')
df_company = split_list(movies, 'production_companies')
print(df_company.shape)
print(df_genre.shape)
print(df_actor.shape)
(23143, 13) (24650, 13) (48020, 13)
We have created 3 new data frame with more row tahn movies data and with the same number of columns.
df_company.head(3)
| popularity | original_title | cast | director | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | Jurassic World | [Chris Pratt, Bryce Dallas Howard, Irrfan Khan... | Colin Trevorrow | [Action, Adventure, Science Fiction, Thriller] | Universal Studios | 2015-06-09 | 5562 | 6.5 | 2015 | 138.0 | 1392.45 | 1254.45 |
| 0 | 32.985763 | Jurassic World | [Chris Pratt, Bryce Dallas Howard, Irrfan Khan... | Colin Trevorrow | [Action, Adventure, Science Fiction, Thriller] | Amblin Entertainment | 2015-06-09 | 5562 | 6.5 | 2015 | 138.0 | 1392.45 | 1254.45 |
| 0 | 32.985763 | Jurassic World | [Chris Pratt, Bryce Dallas Howard, Irrfan Khan... | Colin Trevorrow | [Action, Adventure, Science Fiction, Thriller] | Legendary Pictures | 2015-06-09 | 5562 | 6.5 | 2015 | 138.0 | 1392.45 | 1254.45 |
df_actor.head(3)
| popularity | original_title | cast | director | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | Jurassic World | Chris Pratt | Colin Trevorrow | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 138.0 | 1392.45 | 1254.45 |
| 0 | 32.985763 | Jurassic World | Bryce Dallas Howard | Colin Trevorrow | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 138.0 | 1392.45 | 1254.45 |
| 0 | 32.985763 | Jurassic World | Irrfan Khan | Colin Trevorrow | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 138.0 | 1392.45 | 1254.45 |
df_genre.head(3)
| popularity | original_title | cast | director | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | Jurassic World | [Chris Pratt, Bryce Dallas Howard, Irrfan Khan... | Colin Trevorrow | Action | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 138.0 | 1392.45 | 1254.45 |
| 0 | 32.985763 | Jurassic World | [Chris Pratt, Bryce Dallas Howard, Irrfan Khan... | Colin Trevorrow | Adventure | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 138.0 | 1392.45 | 1254.45 |
| 0 | 32.985763 | Jurassic World | [Chris Pratt, Bryce Dallas Howard, Irrfan Khan... | Colin Trevorrow | Science Fiction | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 138.0 | 1392.45 | 1254.45 |
We must take into account that our analysis is partial, since a film can have several genres; for example, 'Jurassic World' has four genres. 'Action', 'Adventure', 'Science Fiction', 'Thriller'. What percentage of each is involved in the film? we really don't know. So when we calculate the number of movies that have a given genre, all the genres involved in that movie will count as one movie made. It also happens when we take the average popularity since all the genres of a specific movie are going to have the same popularity. So this is a limitation of this work.
# We are going to take the mean of the popularity, grouping by genres and year, and order
# by year and popularity that give us the popularity of each genre by year.
genre_max =df_genre.groupby(['genres', 'release_year']).agg({'original_title':'count', 'popularity':
'mean'}).sort_values( by = ['release_year', 'popularity'],
ascending=False).reset_index()
genre_max.head(20)
| genres | release_year | original_title | popularity | |
|---|---|---|---|---|
| 0 | Adventure | 2015 | 66 | 3.417373 |
| 1 | Western | 2015 | 6 | 3.178796 |
| 2 | Science Fiction | 2015 | 82 | 2.348382 |
| 3 | Action | 2015 | 103 | 2.332647 |
| 4 | Crime | 2015 | 47 | 1.845816 |
| 5 | Fantasy | 2015 | 32 | 1.790357 |
| 6 | Thriller | 2015 | 160 | 1.486609 |
| 7 | Family | 2015 | 41 | 1.447722 |
| 8 | War | 2015 | 8 | 1.438548 |
| 9 | Mystery | 2015 | 39 | 1.294805 |
| 10 | Animation | 2015 | 37 | 1.195554 |
| 11 | Romance | 2015 | 45 | 1.158760 |
| 12 | History | 2015 | 13 | 1.136360 |
| 13 | Comedy | 2015 | 143 | 1.040572 |
| 14 | Drama | 2015 | 233 | 1.026397 |
| 15 | Music | 2015 | 31 | 0.660579 |
| 16 | Horror | 2015 | 115 | 0.503531 |
| 17 | TV Movie | 2015 | 13 | 0.280440 |
| 18 | Documentary | 2015 | 38 | 0.215164 |
| 19 | Science Fiction | 2014 | 57 | 2.486967 |
genre_max.tail(20)
| genres | release_year | original_title | popularity | |
|---|---|---|---|---|
| 1022 | Horror | 1961 | 3 | 0.254578 |
| 1023 | Western | 1961 | 3 | 0.210021 |
| 1024 | Fantasy | 1961 | 1 | 0.113651 |
| 1025 | Foreign | 1961 | 1 | 0.113651 |
| 1026 | Thriller | 1960 | 6 | 0.811910 |
| 1027 | Adventure | 1960 | 5 | 0.700981 |
| 1028 | Horror | 1960 | 7 | 0.591450 |
| 1029 | Action | 1960 | 8 | 0.590724 |
| 1030 | Western | 1960 | 6 | 0.567759 |
| 1031 | Drama | 1960 | 13 | 0.566305 |
| 1032 | Romance | 1960 | 6 | 0.541227 |
| 1033 | Fantasy | 1960 | 2 | 0.428247 |
| 1034 | Music | 1960 | 1 | 0.423531 |
| 1035 | History | 1960 | 5 | 0.412364 |
| 1036 | Comedy | 1960 | 8 | 0.396000 |
| 1037 | Crime | 1960 | 2 | 0.346479 |
| 1038 | Science Fiction | 1960 | 3 | 0.327905 |
| 1039 | Family | 1960 | 3 | 0.278064 |
| 1040 | War | 1960 | 2 | 0.225863 |
| 1041 | Foreign | 1960 | 1 | 0.194948 |
# Now We get the index of the genre by year with the maximun value
id_genres = genre_max.groupby(['release_year'])['popularity'].transform(max) == genre_max['popularity']
# We select the row by the index a order by year
popular_gender_year = genre_max[id_genres].sort_values(by = 'release_year', ascending = True)
popular_gender_year
| genres | release_year | original_title | popularity | |
|---|---|---|---|---|
| 1026 | Thriller | 1960 | 6 | 0.811910 |
| 1009 | Animation | 1961 | 1 | 2.631987 |
| 992 | Adventure | 1962 | 7 | 0.942513 |
| 975 | Animation | 1963 | 1 | 2.180410 |
| 957 | War | 1964 | 3 | 0.930959 |
| 940 | Music | 1965 | 2 | 0.968850 |
| 922 | Animation | 1966 | 2 | 0.751857 |
| 905 | Animation | 1967 | 1 | 2.550704 |
| 888 | Mystery | 1968 | 4 | 1.519456 |
| 870 | Crime | 1969 | 3 | 0.948020 |
| 850 | Animation | 1970 | 2 | 1.127719 |
| 833 | Fantasy | 1971 | 2 | 1.530722 |
| 832 | Family | 1971 | 2 | 1.530722 |
| 817 | Crime | 1972 | 6 | 1.226152 |
| 799 | Animation | 1973 | 3 | 0.956526 |
| 780 | Mystery | 1974 | 5 | 0.702035 |
| 762 | Adventure | 1975 | 8 | 0.880297 |
| 743 | Crime | 1976 | 8 | 0.707249 |
| 725 | Science Fiction | 1977 | 10 | 1.508652 |
| 705 | Music | 1978 | 3 | 0.679805 |
| 689 | Action | 1979 | 8 | 1.410014 |
| 671 | Science Fiction | 1980 | 11 | 0.897143 |
| 653 | Adventure | 1981 | 12 | 0.981504 |
| 633 | War | 1982 | 2 | 1.143182 |
| 614 | Adventure | 1983 | 14 | 0.900596 |
| 597 | Family | 1984 | 9 | 0.823924 |
| 580 | Family | 1985 | 11 | 0.976498 |
| 563 | Animation | 1986 | 5 | 0.852705 |
| 545 | History | 1987 | 3 | 0.815643 |
| 526 | Animation | 1988 | 4 | 0.908417 |
| 507 | Animation | 1989 | 3 | 1.683815 |
| 488 | Adventure | 1990 | 22 | 0.824644 |
| 469 | Animation | 1991 | 3 | 1.665002 |
| 451 | Animation | 1992 | 4 | 1.286893 |
| 433 | Fantasy | 1993 | 13 | 0.918601 |
| 413 | Crime | 1994 | 25 | 1.344236 |
| 394 | Animation | 1995 | 6 | 1.681856 |
| 375 | Crime | 1996 | 19 | 0.976838 |
| 356 | Animation | 1997 | 8 | 1.174855 |
| 336 | War | 1998 | 3 | 1.246619 |
| 316 | Adventure | 1999 | 17 | 1.167837 |
| 296 | War | 2000 | 4 | 0.909605 |
| 276 | Fantasy | 2001 | 21 | 1.831481 |
| 256 | Fantasy | 2002 | 17 | 1.603498 |
| 236 | Fantasy | 2003 | 24 | 1.951103 |
| 216 | Fantasy | 2004 | 17 | 1.432060 |
| 196 | Fantasy | 2005 | 32 | 1.198583 |
| 176 | Fantasy | 2006 | 30 | 1.096642 |
| 156 | Fantasy | 2007 | 45 | 0.985873 |
| 136 | Adventure | 2008 | 58 | 1.066031 |
| 117 | Adventure | 2009 | 67 | 1.207174 |
| 97 | Adventure | 2010 | 57 | 1.393734 |
| 77 | Adventure | 2011 | 56 | 1.210886 |
| 57 | Western | 2012 | 3 | 2.242083 |
| 38 | Science Fiction | 2013 | 51 | 1.443409 |
| 19 | Science Fiction | 2014 | 57 | 2.486967 |
| 0 | Adventure | 2015 | 66 | 3.417373 |
popular_gender_year.groupby('genres')['release_year'].count().sort_values(ascending =False)
genres Animation 13 Adventure 11 Fantasy 9 Crime 5 Science Fiction 4 War 4 Family 3 Music 2 Mystery 2 Action 1 History 1 Thriller 1 Western 1 Name: release_year, dtype: int64
genre10 = genre_max.groupby('release_year').popularity.nlargest(10).reset_index(0)
genre10 = genre_max.loc[genre10.index.values]
genre10.head(20)
| genres | release_year | original_title | popularity | |
|---|---|---|---|---|
| 1026 | Thriller | 1960 | 6 | 0.811910 |
| 1027 | Adventure | 1960 | 5 | 0.700981 |
| 1028 | Horror | 1960 | 7 | 0.591450 |
| 1029 | Action | 1960 | 8 | 0.590724 |
| 1030 | Western | 1960 | 6 | 0.567759 |
| 1031 | Drama | 1960 | 13 | 0.566305 |
| 1032 | Romance | 1960 | 6 | 0.541227 |
| 1033 | Fantasy | 1960 | 2 | 0.428247 |
| 1034 | Music | 1960 | 1 | 0.423531 |
| 1035 | History | 1960 | 5 | 0.412364 |
| 1009 | Animation | 1961 | 1 | 2.631987 |
| 1010 | Family | 1961 | 4 | 0.896028 |
| 1011 | Crime | 1961 | 2 | 0.806519 |
| 1012 | Adventure | 1961 | 6 | 0.767323 |
| 1013 | Mystery | 1961 | 1 | 0.712793 |
| 1014 | Comedy | 1961 | 10 | 0.679838 |
| 1015 | Music | 1961 | 2 | 0.564748 |
| 1016 | War | 1961 | 2 | 0.531184 |
| 1017 | Drama | 1961 | 16 | 0.432233 |
| 1018 | Romance | 1961 | 7 | 0.426036 |
rows, cols = 18, 3
fig, ax = plt.subplots(rows, cols, figsize=(120, 900))
year = 1959
for row in range(rows):
for col in range(cols):
ax[row, col].bar(x ='genres', height = 'popularity',
data= genre10[genre10['release_year'] == year +1])
ax[row, col].tick_params(axis='both', which='major', labelsize=60)
ax[row, col].set_ylabel( 'Popularity mean',fontsize = 65)
ax[row, col].set_title( 'Genres by popularity ' + '(' + str(year + 1) +')', fontsize = 70)
ax[row, col].tick_params(axis='x', labelrotation=30, size=100)
year += 1
plt.show()
df_genre.groupby('genres').agg({'original_title':'count'}).sort_values( by = 'original_title',
ascending=False).reset_index()
| genres | original_title | |
|---|---|---|
| 0 | Drama | 4364 |
| 1 | Comedy | 3433 |
| 2 | Thriller | 2746 |
| 3 | Action | 2235 |
| 4 | Romance | 1570 |
| 5 | Horror | 1526 |
| 6 | Adventure | 1384 |
| 7 | Crime | 1299 |
| 8 | Science Fiction | 1136 |
| 9 | Family | 1095 |
| 10 | Fantasy | 840 |
| 11 | Mystery | 773 |
| 12 | Animation | 617 |
| 13 | Music | 339 |
| 14 | Documentary | 317 |
| 15 | History | 306 |
| 16 | War | 258 |
| 17 | Western | 160 |
| 18 | TV Movie | 132 |
| 19 | Foreign | 120 |
# Lets make a plot with all the genres with the number of movies that each one has
sns.set_context("paper", rc={"font.size":30,"axes.titlesize":20,"axes.labelsize":15})
g =sns.catplot(x="genres", y="original_title", kind="bar", color= '#abc9ea',
data=df_genre.groupby('genres').agg({'original_title':'count'}).sort_values( by = 'original_title',
ascending=False).reset_index())
g.set_axis_labels("genres", "Total Movies")
g.set(title='Most Common Genres in Total Movies Over Time')
plt.gcf().set_size_inches(20, 7)
We have to be aware here that the same movie can be categorized with more the one genre, so we are going to get a plot when the sum of all number of movie of each genre is goin to be greather than the toyal number of movie that year.
#Firts we are going to check how many movies there are by genre and year
genres_by_year = df_genre.groupby(['genres', 'release_year']).agg({'original_title':'count', 'profit': 'sum'}).sort_values( by = 'release_year',
ascending=False).reset_index()
genres_by_year.head(10)
| genres | release_year | original_title | profit | |
|---|---|---|---|---|
| 0 | Western | 2015 | 6 | 402.16 |
| 1 | Crime | 2015 | 47 | 2875.57 |
| 2 | Animation | 2015 | 37 | 3002.85 |
| 3 | Mystery | 2015 | 39 | 388.86 |
| 4 | Music | 2015 | 31 | 517.49 |
| 5 | Action | 2015 | 103 | 8898.88 |
| 6 | Science Fiction | 2015 | 82 | 6592.38 |
| 7 | Documentary | 2015 | 38 | 25.70 |
| 8 | Comedy | 2015 | 143 | 5025.56 |
| 9 | Horror | 2015 | 115 | 622.21 |
# This plot shows the evolution over time of the amout of movies release in each genres
sns.set_style("darkgrid")
g =sns.relplot(x="release_year", y="original_title",
col="genres", col_wrap=4,
height=5, aspect=.75, linewidth=2,
kind="line", data=(genres_by_year))
g.set_axis_labels("Year", "Number of Movies");
genres_period =genres_by_year.set_index(['genres', 'release_year'])[['original_title']]
genres_period
| original_title | ||
|---|---|---|
| genres | release_year | |
| Western | 2015 | 6 |
| Crime | 2015 | 47 |
| Animation | 2015 | 37 |
| Mystery | 2015 | 39 |
| Music | 2015 | 31 |
| ... | ... | ... |
| Romance | 1960 | 6 |
| Science Fiction | 1960 | 3 |
| Thriller | 1960 | 6 |
| War | 1960 | 2 |
| Action | 1960 | 8 |
1042 rows × 1 columns
## Porcentage of each genre by year
genres_porcentage_by_year = genres_period.groupby(level=1).apply(lambda x:
100 * x / float(x.sum()))
genres_porcentage_by_year = genres_porcentage_by_year.sort_values(by = 'release_year',
ascending = True).reset_index()
genres_porcentage_by_year
<ipython-input-111-a4c6ea7bc19e>:2: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object. To preserve the previous behavior, use >>> .groupby(..., group_keys=False) To adopt the future behavior and silence this warning, use >>> .groupby(..., group_keys=True) genres_porcentage_by_year = genres_period.groupby(level=1).apply(lambda x:
| genres | release_year | original_title | |
|---|---|---|---|
| 0 | Action | 1960 | 10.256410 |
| 1 | Adventure | 1960 | 6.410256 |
| 2 | Western | 1960 | 7.692308 |
| 3 | Comedy | 1960 | 10.256410 |
| 4 | Crime | 1960 | 2.564103 |
| ... | ... | ... | ... |
| 1037 | Adventure | 2015 | 5.271565 |
| 1038 | Family | 2015 | 3.274760 |
| 1039 | War | 2015 | 0.638978 |
| 1040 | Horror | 2015 | 9.185304 |
| 1041 | Western | 2015 | 0.479233 |
1042 rows × 3 columns
# This plot shows the evolution over time of the amout of movies release in each genres in porcentage
sns.set_style("darkgrid")
g =sns.relplot(x="release_year", y="original_title",
col="genres", col_wrap=4,
height=5, aspect=.75, linewidth=2,
kind="line", data=(genres_porcentage_by_year))
g.set_axis_labels("year", "% movies over total year");
We are going to see which is in average the most profitable genre over time
top20_genre = df_genre.groupby('genres').agg({'profit': 'mean'}).sort_values(by = 'profit',
ascending =False).reset_index()
top20_genre
| genres | profit | |
|---|---|---|
| 0 | Adventure | 103.869812 |
| 1 | Fantasy | 80.468131 |
| 2 | Family | 68.355863 |
| 3 | Animation | 67.920470 |
| 4 | Action | 62.803414 |
| 5 | Science Fiction | 62.140070 |
| 6 | War | 43.315233 |
| 7 | Music | 40.616195 |
| 8 | Thriller | 36.696413 |
| 9 | Crime | 36.659353 |
| 10 | Romance | 35.488917 |
| 11 | Comedy | 35.342304 |
| 12 | Mystery | 32.906559 |
| 13 | Drama | 27.908902 |
| 14 | History | 23.742484 |
| 15 | Western | 22.278562 |
| 16 | Horror | 17.528670 |
| 17 | Documentary | 2.610694 |
| 18 | TV Movie | 0.067803 |
| 19 | Foreign | -1.576250 |
# Lets make a plot with all the genres with the number of movies that each one has
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15})
sns.set_style("darkgrid")
g =sns.catplot(x="genres", y="profit", kind="bar", color= '#abc9ea', data=top20_genre)
g.set_axis_labels("Genre", "Total Profit (US $)")
g.set_xticklabels(rotation=30)
g.set(title='Top 20 Profitable Genre by Movie Over Time')
plt.gcf().set_size_inches(20, 7)
# Let's group by name of company and calculate the sum of all profits, number of movies, sum the budget of
# all movies, sum of all revenues, mean of vote_average and so on.
#
companies_metric = df_company.groupby('production_companies').agg({'profit':'sum',
'original_title': 'count', 'budget_adj': 'sum',
'revenue_adj': 'sum','vote_average': 'mean','vote_count': 'sum',
'popularity': 'mean'}).reset_index()
companies_metric.head()
| production_companies | profit | original_title | budget_adj | revenue_adj | vote_average | vote_count | popularity | |
|---|---|---|---|---|---|---|---|---|
| 0 | "DIA" Productions GmbH & Co. KG | 9.90 | 1 | 42.670000 | 52.570000 | 5.7 | 152 | 1.626534 |
| 1 | # Lexyn Productions | 0.00 | 1 | 37.336967 | 116.786792 | 5.6 | 15 | 0.468552 |
| 2 | 10 West Studios | 0.00 | 1 | 37.336967 | 116.786792 | 7.4 | 30 | 0.245948 |
| 3 | 100 Feet Productions | -10.13 | 1 | 10.130000 | 116.786792 | 5.8 | 35 | 0.595424 |
| 4 | 1019 Entertainment | 0.74 | 1 | 1.840000 | 2.580000 | 5.7 | 55 | 0.336107 |
companies_metric['profit/movie'] = companies_metric['profit'].div(companies_metric.original_title).round(2)
print(companies_metric.shape)
print(companies_metric.production_companies.nunique())
companies_metric.head()
(7842, 9) 7842
| production_companies | profit | original_title | budget_adj | revenue_adj | vote_average | vote_count | popularity | profit/movie | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | "DIA" Productions GmbH & Co. KG | 9.90 | 1 | 42.670000 | 52.570000 | 5.7 | 152 | 1.626534 | 9.90 |
| 1 | # Lexyn Productions | 0.00 | 1 | 37.336967 | 116.786792 | 5.6 | 15 | 0.468552 | 0.00 |
| 2 | 10 West Studios | 0.00 | 1 | 37.336967 | 116.786792 | 7.4 | 30 | 0.245948 | 0.00 |
| 3 | 100 Feet Productions | -10.13 | 1 | 10.130000 | 116.786792 | 5.8 | 35 | 0.595424 | -10.13 |
| 4 | 1019 Entertainment | 0.74 | 1 | 1.840000 | 2.580000 | 5.7 | 55 | 0.336107 | 0.74 |
We have the same number of rows than the number of unique companies
We can see at the same time how many movies they did to get the total profits.
# We are going to get a table with the 20 most profitable companies and also see how many movies they
# did to get that profit as weel the profit/movie.
top_companies =companies_metric[['production_companies','profit',
'original_title', 'profit/movie']].sort_values( by = 'profit',
ascending = False).reset_index()
top_companies.head(20)
| index | production_companies | profit | original_title | profit/movie | |
|---|---|---|---|---|---|
| 0 | 7546 | Warner Bros. | 46767.28 | 509 | 91.88 |
| 1 | 7338 | Universal Pictures | 45271.39 | 522 | 86.73 |
| 2 | 5281 | Paramount Pictures | 40324.24 | 431 | 93.56 |
| 3 | 7223 | Twentieth Century Fox Film Corporation | 39142.90 | 282 | 138.80 |
| 4 | 7531 | Walt Disney Pictures | 25638.62 | 213 | 120.37 |
| 5 | 1593 | Columbia Pictures | 20475.57 | 272 | 75.28 |
| 6 | 341 | Amblin Entertainment | 17004.63 | 65 | 261.61 |
| 7 | 4892 | New Line Cinema | 15874.04 | 219 | 72.48 |
| 8 | 4539 | Metro-Goldwyn-Mayer (MGM) | 14285.91 | 218 | 65.53 |
| 9 | 4170 | Lucasfilm | 13094.75 | 20 | 654.74 |
| 10 | 2060 | DreamWorks SKG | 11646.31 | 88 | 132.34 |
| 11 | 1594 | Columbia Pictures Corporation | 11568.87 | 160 | 72.31 |
| 12 | 7133 | Touchstone Pictures | 10873.46 | 178 | 61.09 |
| 13 | 2269 | Eon Productions | 10477.72 | 22 | 476.26 |
| 14 | 2057 | DreamWorks Animation | 8685.86 | 46 | 188.82 |
| 15 | 3196 | Heyday Films | 8409.45 | 18 | 467.19 |
| 16 | 7444 | Village Roadshow Pictures | 8359.23 | 88 | 94.99 |
| 17 | 2082 | Dune Entertainment | 8168.99 | 55 | 148.53 |
| 18 | 5436 | Pixar Animation Studios | 8010.47 | 44 | 182.06 |
| 19 | 5800 | Relativity Media | 7925.31 | 108 | 73.38 |
# Let's get the corralation among profit, numbers o movies a profit by movie
companies_metric[['production_companies','profit',
'original_title', 'profit/movie']].sort_values( by = 'profit', ascending = False).reset_index().corr()
<ipython-input-119-a58a40208365>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. companies_metric[['production_companies','profit',
| index | profit | original_title | profit/movie | |
|---|---|---|---|---|
| index | 1.000000 | 0.031844 | 0.031531 | 0.001976 |
| profit | 0.031844 | 1.000000 | 0.910943 | 0.207911 |
| original_title | 0.031531 | 0.910943 | 1.000000 | 0.065098 |
| profit/movie | 0.001976 | 0.207911 | 0.065098 | 1.000000 |
# Lets make a plot that shows the top 10 profitable companies over time
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15})
sns.set_style("darkgrid")
g =sns.catplot(x="production_companies", y="profit", kind="bar", color= '#abc9ea', data=top_companies.head(20))
g.set_axis_labels("Company", "Total Profit (US $)")
g.set_xticklabels(rotation=30)
g.set(title='Top 20 Profitable Companies Over Time')
plt.gcf().set_size_inches(20, 7)
The idea is to see in a plot this relation between those variabels
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15})
g = sns.regplot(x="original_title", y="profit", data= top_companies, scatter_kws={'alpha':0.4})
g.set_xlabel('Number of Movies')
g.set_ylabel( 'Total Profit (millions $)')
g.set(title='Total Number of Movies vs Total Profit by Company')
plt.gcf().set_size_inches(10, 7);
we are going to choose companies with at least 5 movies to see how much return by movie those companies get.
profit_by_movie =companies_metric[['production_companies','original_title', 'profit',
'profit/movie', 'vote_average']].sort_values( by = 'profit/movie', ascending =False)
profit_by_movie.rename(columns={'original_title': 'number_movies'}, inplace =True)
profit_by_movie[profit_by_movie['number_movies'] >= 5].head(30)
| production_companies | number_movies | profit | profit/movie | vote_average | |
|---|---|---|---|---|---|
| 4053 | Lightstorm Entertainment | 7 | 5847.47 | 835.35 | 6.814286 |
| 4170 | Lucasfilm | 20 | 13094.75 | 654.74 | 6.690000 |
| 1803 | Danjaq | 6 | 3177.70 | 529.62 | 6.300000 |
| 2269 | Eon Productions | 22 | 10477.72 | 476.26 | 6.186364 |
| 3196 | Heyday Films | 18 | 8409.45 | 467.19 | 7.066667 |
| 7669 | WingNut Films | 15 | 6062.44 | 404.16 | 6.980000 |
| 6721 | Syncopy | 9 | 3441.20 | 382.36 | 7.344444 |
| 5195 | Pacific Data Images (PDI) | 7 | 2635.18 | 376.45 | 6.228571 |
| 3943 | Laura Ziskin Productions | 5 | 1869.98 | 374.00 | 6.160000 |
| 7770 | Zanuck/Brown Productions | 6 | 2196.37 | 366.06 | 6.283333 |
| 704 | Bad Robot | 10 | 3622.57 | 362.26 | 6.830000 |
| 7222 | Twentieth Century Fox Animation | 12 | 4088.13 | 340.68 | 6.550000 |
| 1586 | Color Force | 8 | 2454.37 | 306.80 | 6.637500 |
| 1906 | Dentsu | 13 | 3817.94 | 293.69 | 6.307692 |
| 5101 | One Race Productions | 5 | 1425.91 | 285.18 | 6.500000 |
| 6685 | Sunswept Entertainment | 12 | 3353.21 | 279.43 | 6.050000 |
| 1339 | Centropolis Entertainment | 10 | 2779.83 | 277.98 | 5.940000 |
| 4401 | Marvel Entertainment | 6 | 1662.26 | 277.04 | 6.000000 |
| 5948 | Roth Films | 7 | 1907.15 | 272.45 | 6.071429 |
| 3395 | Imprint Entertainment | 6 | 1583.62 | 263.94 | 5.483333 |
| 341 | Amblin Entertainment | 65 | 17004.63 | 261.61 | 6.409231 |
| 6848 | Temple Hill Entertainment | 12 | 3105.06 | 258.76 | 6.241667 |
| 958 | Blue Sky Studios | 16 | 4037.86 | 252.37 | 6.581250 |
| 4433 | Maverick Films | 7 | 1697.50 | 242.50 | 5.400000 |
| 4738 | Moving Picture Company (MPC) | 11 | 2651.55 | 241.05 | 6.100000 |
| 11 | 1492 Pictures | 19 | 4464.12 | 234.95 | 6.089474 |
| 4405 | Marvel Studios | 27 | 6342.03 | 234.89 | 6.759259 |
| 6329 | Silver Screen Partners IV | 8 | 1864.68 | 233.08 | 6.675000 |
| 3625 | Jerry Bruckheimer Films | 25 | 5823.94 | 232.96 | 6.252000 |
| 4400 | Marvel Enterprises | 16 | 3696.75 | 231.05 | 5.700000 |
sns.set_context("paper", rc={"font.size":15,"axes.titlesize":20,"axes.labelsize":15, 'legend.fontsize':15})
g = sns.relplot(x="profit/movie", y="number_movies", size="profit",
sizes=(70, 300),alpha =0.95, data=profit_by_movie[profit_by_movie['profit'] >5])
g.set_xlabels('Profit by Movie')
g.set_ylabels('Numeber of Movies')
plt.text(x= 500, y=600, s='Profit by Movie vs Total Number of Movies', fontsize=20, weight='bold')
plt.text(x= 500, y=570, s='The size of each point corresponds to total profit in millions', fontsize=14, alpha=0.75)
plt.gcf().set_size_inches(15, 7);
# Get the firts 20 actors with more movies
df_actor.groupby('cast').agg({'original_title':'count','vote_average': 'mean',
'profit': 'sum'}).sort_values(by = 'original_title', ascending=False).head(20)
| original_title | vote_average | profit | |
|---|---|---|---|
| cast | |||
| Robert De Niro | 72 | 6.330556 | 3743.08 |
| Samuel L. Jackson | 70 | 6.101429 | 5111.11 |
| Bruce Willis | 62 | 5.903226 | 4900.54 |
| Nicolas Cage | 61 | 5.777049 | 2847.80 |
| Michael Caine | 52 | 6.353846 | 3799.98 |
| Robin Williams | 51 | 6.043137 | 5430.50 |
| Morgan Freeman | 49 | 6.383673 | 3329.17 |
| John Goodman | 48 | 6.177083 | 2820.92 |
| John Cusack | 48 | 6.077083 | 1383.98 |
| Susan Sarandon | 47 | 6.044681 | 1044.57 |
| Liam Neeson | 47 | 6.159574 | 4340.60 |
| Alec Baldwin | 47 | 5.942553 | 900.46 |
| Julianne Moore | 46 | 6.193478 | 1574.17 |
| Tom Hanks | 46 | 6.582609 | 8316.47 |
| Dennis Quaid | 45 | 6.008889 | 1276.69 |
| Gene Hackman | 45 | 6.264444 | 3604.37 |
| Willem Dafoe | 44 | 6.093182 | 2911.56 |
| Sylvester Stallone | 44 | 5.968182 | 4805.37 |
| Meryl Streep | 44 | 6.384091 | 2688.69 |
| Clint Eastwood | 43 | 6.369767 | 3192.00 |
# Get the top 20 profitable actors
df_actor.groupby('cast').agg({'original_title':'count','vote_average': 'mean',
'profit': 'sum'}).sort_values(by = 'profit', ascending=False).head(20)
| original_title | vote_average | profit | |
|---|---|---|---|
| cast | |||
| Harrison Ford | 42 | 6.347619 | 12123.16 |
| Tom Cruise | 37 | 6.410811 | 8399.51 |
| Tom Hanks | 46 | 6.582609 | 8316.47 |
| Carrie Fisher | 11 | 6.627273 | 7270.69 |
| Emma Watson | 14 | 6.728571 | 7228.30 |
| Daniel Radcliffe | 16 | 6.875000 | 7070.93 |
| Mark Hamill | 16 | 6.587500 | 7044.58 |
| Rupert Grint | 14 | 6.850000 | 6974.65 |
| Ian McKellen | 21 | 6.576190 | 6596.22 |
| Sean Connery | 40 | 6.285000 | 6121.58 |
| Orlando Bloom | 16 | 6.387500 | 5941.94 |
| Cameron Diaz | 38 | 6.039474 | 5867.76 |
| Eddie Murphy | 42 | 5.702381 | 5789.11 |
| Ralph Fiennes | 34 | 6.455882 | 5646.29 |
| Robert Downey Jr. | 41 | 6.195122 | 5528.12 |
| Leonardo DiCaprio | 30 | 6.733333 | 5513.11 |
| Will Smith | 22 | 6.345455 | 5447.11 |
| Robin Williams | 51 | 6.043137 | 5430.50 |
| Sandra Bullock | 33 | 6.021212 | 5409.44 |
| Johnny Depp | 43 | 6.458140 | 5284.71 |
To be consistent with the data, we're only going to consider actors with at least 30 movies to get the rank, which gives us a sense of performance over time
top_vote_average =df_actor.groupby('cast').agg({'original_title': 'count','vote_average': 'mean'})
top_vote_average =top_vote_average[top_vote_average['original_title'] >=30].sort_values(by = 'vote_average',
ascending = False).reset_index()
top_vote_average.head(10)
| cast | original_title | vote_average | |
|---|---|---|---|
| 0 | Brad Pitt | 38 | 6.736842 |
| 1 | Leonardo DiCaprio | 30 | 6.733333 |
| 2 | Tom Hanks | 46 | 6.582609 |
| 3 | Christian Bale | 31 | 6.522581 |
| 4 | Scarlett Johansson | 32 | 6.509375 |
| 5 | Denzel Washington | 38 | 6.502632 |
| 6 | Woody Allen | 36 | 6.472222 |
| 7 | Johnny Depp | 43 | 6.458140 |
| 8 | Ralph Fiennes | 34 | 6.455882 |
| 9 | Cate Blanchett | 31 | 6.454839 |
To answer this question we have to make a summary of the analysis carried out.
We have had some limitations in our analysis. First, at data cleanup time, we removed 1,030 rows where production_companies appeared with na's approximately 10% of our data, perhaps it would have been more appropriate to search for that information on the internet. We also found over 5,000 rows where budget and revenue had a zero value (more than 50% of the data), so we decided to take the average of those columns and replace the zero values with the mean. Being such a high percentage of rows, the most logical thing would be to try to search for that data on the internet and replace it with real values since this distorts the analysis, but for the purposes of carrying out this project we decided to implement the average. Second, since a movie can have several genres; for example, 'Jurassic World' has four genres. 'Action', 'Adventure', 'Science Fiction', 'Thriller'. What percentage of each is involved in the film? we really don't know. So when we calculate the number of movies that have a given genre, all the genres involved in that movie will count as one movie made, which is another limitation of our analysis. Something similar happens when we take the average popularity of the genres, since all the genres of a specific movie are going to have the same popularity, so this is another limitation of this work and the same applies to the actors or actresses, since they do not we have an order of importance of them.